Conversation
…utorial when it is published.
|
Can I ask if a lot of this was written by AI? I'm very surprised by a lot of the text. Also the notion that one might as a SW developer 'choose' between TBR vs IMR, and have code trying to determine what to pick (ref the |
|
Parts were written by AI, specifically the power consumption analyzer code as that is outside my normal wheelhouse; but looking at the references they looked solid and I edited it to make it read mostly correct to me. So, genesis by AI sure, but heavy human editing. |
|
I'm not able to a point by point feedback, but for example the |
|
Okay, no worries, I'll rewrite it. |
|
Probably not too useful to drop more 'random' drive-by comments like this, but for another example I think none of the use-cases mentioned for VK_EXT_shader_tile_image (bloom, edge-detection, FXAA, SSR) makes sense as the extension only gives access to the current pixel while all of these effects need access to other pixels. FWIW I've ping some folks here at Arm to see if we can help review and support development of the guide -- I think it's a great initiative to be clear, but it probably needs some close review especially as there is not too much good and up-to-date public info about current mobile GPUs to pull from (hence also why the idea of the guide is good, of course) :) |
|
Thanks it's MUCH appreciated. I'm by far not the best expert at TBR; and I really want to try to get updated information out there. There's a reason I read all of the research articles linked and tried to put as much research into this chapter as I could. If we could get more details and more review, I'm much happier. Soon as I get a chance, I'm going to update from the comments already generated here. |
|
The chapter title is Tile-Based Rendering Best Practices, but most of what it talks about is nothing to do with tile-based rendering but related to other aspects of vendor-specific implementation detail or orthogonal mobile GPU issues (constant registers, coherent memory, thermal, etc). For a Vulkan guide I'd probably split this up - having a topic focused only on the effects of being tile based is useful and the rest is somewhat a distraction. The most important things for tilers (good use of loadOp/storeOp) seems to be buried right at the end, and the second most important (good use of pipeline barriers to get pipelining) isn't mentioned at all. |
|
Not that much of a hardware guy, but isn't laziliy allocated memory / transient attachment and important Vulkan concept for TBRs? If so might be good to add that. |
|
And I second the remarks about the power consumption part of that chapter. I tried to understand the code and data, but felt kinda lost. Wouldn't stuff like that require querying vendor specific apis to get real world power usage? Didn't see that mentioned anywhere. |
|
Also some of the links don't point to anything usefull, e.g. these: Imagination PowerVR Architecture Guide: Shows tile memory providing 10-20x bandwidth compared to external memory Qualcomm Adreno Performance Guide: Demonstrates GMEM (tile memory) efficiency in mobile gaming scenarios NVIDIA Tegra TBR Analysis: Research paper showing 60% power reduction through bandwidth optimization IEEE Computer Graphics and Applications: Tile-Based Rendering analysis and improvements research IEEE Transactions on Computers: Thermal management in mobile graphics processing research Either point to or redirect to a (company) landing page instead of the linked e.g. "Research papers" or documents. |
|
And other links don't make sense, e.g. this: Vulkan-Hpp: Modern C++ bindings with TBR optimization examples That links to the Vulkan-Hpp headers, I don't see why or how that relates to TBR optimizations? |
|
I'm going to rewrite this. Sorry not ready for prime time. |
|
Huawei Maleoon GPU Guide: Maleoon GPU Rendering Optimization |
|
|
||
| - **Attachment Configuration**: Final attachments use `VK_ATTACHMENT_STORE_OP_STORE`, intermediate attachments use `VK_ATTACHMENT_STORE_OP_DONT_CARE` | ||
| - **Load Operations**: Use `VK_ATTACHMENT_LOAD_OP_CLEAR` for new content, `VK_ATTACHMENT_LOAD_OP_DONT_CARE` for intermediate results | ||
| - **MSAA Efficiency**: TBR handles 4x MSAA efficiently due to tile memory resolve capabilities |
There was a problem hiding this comment.
Not sure I'd call out 4x specifically - makes it sound like you should prefer it over 2x, or 8x for example - which I'd not say is generic advice. Though tile memory resolve can be a good source of performance gain if you are going to be using MSAA.
There was a problem hiding this comment.
Removed that advice to ensure it's clear.
|
|
||
| **Tile Memory Management Strategies:** | ||
|
|
||
| - **Memory Calculation**: Typical tile memory 512KB, calculate usage based on tile size (32x32 pixels), format size, and sample count |
There was a problem hiding this comment.
This calculation is not easy to perform for a developer as the determinations are not quite as simple as that. Different formats might not be stored in tile memory in the way you might naively expect. Also how MSAA affects tile size is also not widely documented.
| === Half-Precision Float Optimization | ||
|
|
||
| Using half-precision floats in shaders can speed up execution and reduce bandwidth on mobile TBR devices. Use low-precision numbers in fragment and compute shaders when visual quality permits: | ||
|
|
There was a problem hiding this comment.
Might be worth pointing out that mediump should be checked on as many devices as possible as they act as something of a hint - using mediump and testing only on one device that mayu under the hood still be using F32 can be very misleading and lead to visual issues on devices actually employing mediump.
I'd still recommend using it whenever possible, but it might be a worth while note/pointer
There was a problem hiding this comment.
Removed it to ensure the information I have in here is correct and something I've been able to verify. If you recommend adding it back, I'd be happy to.
…ines, and implementation-agnostic practices.
| **Bandwidth Optimization Strategies:** | ||
|
|
||
| - **Attachment configuration**: Final attachments use `VK_ATTACHMENT_STORE_OP_STORE`; intermediate attachments use `VK_ATTACHMENT_STORE_OP_DONT_CARE` when you do not need the results. | ||
| - **Load operations**: Use `VK_ATTACHMENT_LOAD_OP_CLEAR` for new content; `VK_ATTACHMENT_LOAD_OP_DONT_CARE` for intermediate results you overwrite. |
There was a problem hiding this comment.
I am not sure here. I think that the loadOp can also be set to dont_care when rendering opaque objects, even for new content.
|
|
||
| **Advanced TBR considerations:** | ||
|
|
||
| - Use subpasses and `VK_DEPENDENCY_BY_REGION_BIT` to enable local data reuse where beneficial; always measure on target devices. |
There was a problem hiding this comment.
I think it's worth mentioning here the subpassLoad operator to read pixel value from tile memory.
|
|
||
| - No explicit on-chip tile memory model exposed to applications. | ||
| - Overdraw tends to generate more external memory traffic than on tilers; minimizing overdraw is important. | ||
| - Applications should rely on standard Vulkan techniques (early depth/stencil, appropriate load/store ops, and subpasses where helpful) and profile on target devices. |
There was a problem hiding this comment.
I am seeing "profile on target devices", "measure on target devices", "profiling results on target hardware" many times in this documentation. This kind of redundant phrases should be cleaned up.
|
Currently information is scattered in various corners. And same information appears a few times including thing like "profile on target devices", or "Tile size not exposed by core Vulkan". |
…ation, attachment management, and mobile-specific techniques
…sung GPU framebuffer optimization guide
| == Understanding Tiler Architectures | ||
|
|
||
| Mobile GPUs operate in power-constrained environments, which makes bandwidth efficiency critical. | ||
| Since Vulkan hides many of the internal hardware details—like the exact size of a tile or how the GPU schedules work—the best way to optimize is to provide the driver with clear "intent." |
There was a problem hiding this comment.
Why quotes around "intent"? Doesn't seem to need them for this usage.
| Tilers usually process geometry twice: once to determine which triangles fall into which tiles (binning), and a second time to actually render the pixels. | ||
| To speed up the binning pass, consider storing your vertex positions in a separate buffer from other attributes like UVs or normals. | ||
| This allows the GPU to read only the data it needs to calculate tile coverage, significantly reducing unnecessary bandwidth. | ||
| Ideally, position data should be stored as `highp` to ensure accuracy during this critical phase. |
There was a problem hiding this comment.
Disagree that input position needs to be "highp".
The actual calculation in the shader definitely needs to be done as fp32, but vertex position data in memory only needs to be precise enough in memory to maintain sufficient accuracy in the model object space coordinates. Using unorm16 coordinates for a typical 10 meter real-world equivalent model gives 0.15 mm quantization accuracy which is plenty good enough for most use cases.
|
|
||
| Your primary tool for controlling bandwidth is the render pass attachment configuration. | ||
| Whether you are using traditional `VkRenderPass` objects or the modern `VK_KHR_dynamic_rendering` extension, the principles are the same. | ||
| The `loadOp` and `storeOp` settings are not just "cleanup" steps; they are direct instructions to the hardware. |
| If you know you are going to completely overwrite the tile's contents—for example, by rendering opaque geometry that covers the entire screen—you can use `VK_ATTACHMENT_LOAD_OP_DONT_CARE`. | ||
| This tells the GPU it doesn't need to waste time loading the previous frame's data from memory OR performing a clear. | ||
|
|
||
| Similarly, use `VK_ATTACHMENT_STORE_OP_DONT_CARE` for any attachment you don't need after the pass is finished. |
There was a problem hiding this comment.
or use STORE_OP_NONE if you want to indicate "don't write", but also don't want to logically discard the existing content in memory.
| This tells the GPU it doesn't need to waste time loading the previous frame's data from memory OR performing a clear. | ||
|
|
||
| Similarly, use `VK_ATTACHMENT_STORE_OP_DONT_CARE` for any attachment you don't need after the pass is finished. | ||
| Depth buffers and multisampled "resolve" sources are the most common candidates here. |
There was a problem hiding this comment.
Mentioning resolve here is likely going to make this more confusing to read because it immediately makes you think of resolve attachments, whereas this is about the other kind of attachment. Just "depth attachments, stencil attachments, and multisampled color attachments"?
| This extension simplifies your code by removing the need for render pass and framebuffer objects, but the hardware logic remains identical. | ||
| You must remain disciplined about your load and store operations to avoid performance regressions. | ||
|
|
||
| When using traditional render passes, try to structure them so that the driver can "merge" subpasses. |
There was a problem hiding this comment.
This is a bit of a throw away paragraph that really isn't explained anywhere. Subpass merging could be worth a bit more of a treatment than this, although obviously a bit of an evolutionary dead end.
| [[overdraw-and-sorting]] | ||
| === Managing Overdraw and Depth Logic | ||
|
|
||
| Overdraw is one of the biggest bottlenecks on mobile GPUs. |
There was a problem hiding this comment.
Overdraw cost has nothing to do with being tile based.
It's a problem on less performant mobile GPUs because the GPU is less powerful, so lots of blended overdraw means you can more easily become fragment bound, but that's nothing to do with the GPU being tile based. Delete?
|
|
||
| Overdraw is one of the biggest bottlenecks on mobile GPUs. | ||
| Even though writes are deferred, executing a fragment shader multiple times for the same pixel consumes valuable execution unit (EU) cycles and power. | ||
| Sorting your opaque objects front-to-back is the most effective way to combat this. |
There was a problem hiding this comment.
As above, nothing to do with being tile based, so feels out of place.
| * Avoid using `discard` or writing to `gl_FragDepth` in your shaders unless absolutely necessary, as these operations can force the GPU to disable "early" depth testing and wait for the fragment shader to finish before it can determine visibility. | ||
|
|
||
| [[shader-concurrency]] | ||
| === Shader Complexity and Concurrency |
There was a problem hiding this comment.
Also nothing to do with being tile based. Delete?
There was a problem hiding this comment.
I would personally keep this section with some changes, assuming we are focusing on TBR and mobile GPUs. However, I have no problem deleting it if we want to focus exclusively on TBR optimizations.
Complex shaders are more likely to cause register spilling on mobile GPUs, but this is somewhat orthogonal to tile-based rendering.
Note that this recommendation also applies to desktop GPUs.
There was a problem hiding this comment.
Mobile optimization is a whole different topic (and huge) - the Arm best practices guides > 100 pages because it's a hard problem. Keep this on topic, add other parallel topics on other best practices if enough vendors agree on them.
There was a problem hiding this comment.
I agree it is better to keep them as separate topics, but not a strong opinion
This version already has mobile optimizations orthogonal to TBR, so I am not sure if it is intended.
| [[precision-and-prefetch]] | ||
| === Precision and Texture Optimization | ||
|
|
||
| Using `mediump` (16-bit) instead of `highp` (32-bit) in your shaders is a classic mobile optimization. |
There was a problem hiding this comment.
Also nothing to do with being tile based. Delete?
There was a problem hiding this comment.
Similar, I would keep it with some changes.
I think this is orthogonal to tilers, but it can still benefit bandwidth, so using mediump is recommended on mobile GPUs.
I personally prefer explicit types such as float16.
Note that some desktop GPUs can also benefit from float16.
There was a problem hiding this comment.
Keep it. As a different topic parallel to this one.
| To keep the hardware busy, you want these two stages to overlap as much as possible—the GPU should be binning the next frame while it is still shading the current one. | ||
|
|
||
| Incorrect use of pipeline barriers can break this overlap. | ||
| If you use a barrier that is too broad—like `VK_PIPELINE_STAGE_ALL_COMMANDS_BIT`—you might force the GPU to finish all pending fragment work before it can even start the binning pass for the next set of draws. |
There was a problem hiding this comment.
Not sure how this renders - is this missing spaces around the "—" in "broad—like", etc (occurs in other places in the doc too),
|
A bit orthogonal to tilers, but it might be useful to explain that compute shaders can have lower effective bandwidth compared to the vertex and fragment stages on mobile GPUs. Developers may want to move work to fragment and avoid enabling |
| This extension provides improved robustness when dangerous undefined behavior occurs, such as out-of-bounds array access. This is particularly important for TBR architectures where tile memory constraints can make buffer overruns more problematic. | ||
|
|
||
| **Mobile developer guidance:** | ||
| Mobile developers are strongly encouraged to use VK_EXT_robustness2 when targeting TBR GPUs, as tile memory constraints make out-of-bounds access more likely to cause visible artifacts or crashes. |
There was a problem hiding this comment.
No.
Mobile developers targetting Mali GPUs are encouraged not to use bounds checking.
Robust buffer access is a debugging feature and we recommend be temporarily enabled it to investigate application crashes or visual artifacts. Enabling it in production will negatively impact performance.
Enabling bounds checking causes loss in performance for accesses to uniform buffers and shader storage buffers.
| * **IMR GPUs** typically process triangles and write the resulting fragments to memory almost immediately. They rely on high-bandwidth memory and large caches to handle the traffic. Overdraw on an IMR is expensive because every pixel written potentially triggers a memory write. | ||
| * **TBR GPUs** defer those writes. By "binning" the geometry and processing by tile, they can perform many operations—like blending and depth testing—entirely within the tile memory. The memory write only happens once the tile is finished. | ||
|
|
||
| While you shouldn't try to build a renderer that switches between TBR and IMR logic at runtime, understanding the difference helps you write code that is efficient for both. Good attachment management and avoiding unnecessary overdraw benefit every architecture, but they are absolutely essential for performance on a tiler. |
There was a problem hiding this comment.
Nitpick:
I am not sure what build a renderer that switches between TBR and IMR logic at runtime means.
Note that some GPUs can change between IMR and TBR.
I would change the sentence to something like:
It is not necessary to write a custom path for TBR and IMR GPUs. In general, understanding how a tiler works will help you write code that is efficient for both architectures.
|
|
||
| Overdraw is one of the biggest bottlenecks on mobile GPUs. | ||
| Even though writes are deferred, executing a fragment shader multiple times for the same pixel consumes valuable execution unit (EU) cycles and power. | ||
| Sorting your opaque objects front-to-back is the most effective way to combat this. |
There was a problem hiding this comment.
Sorting your opaque objects front-to-back is the most effective way to combat this.
This was recommended on mobile for longer than on PC. While it is relevant for older GPUs, it is no longer always recommended for newer architectures.
| * Avoid using `discard` or writing to `gl_FragDepth` in your shaders unless absolutely necessary, as these operations can force the GPU to disable "early" depth testing and wait for the fragment shader to finish before it can determine visibility. | ||
|
|
||
| [[shader-concurrency]] | ||
| === Shader Complexity and Concurrency |
There was a problem hiding this comment.
I would personally keep this section with some changes, assuming we are focusing on TBR and mobile GPUs. However, I have no problem deleting it if we want to focus exclusively on TBR optimizations.
Complex shaders are more likely to cause register spilling on mobile GPUs, but this is somewhat orthogonal to tile-based rendering.
Note that this recommendation also applies to desktop GPUs.
| [[precision-and-prefetch]] | ||
| === Precision and Texture Optimization | ||
|
|
||
| Using `mediump` (16-bit) instead of `highp` (32-bit) in your shaders is a classic mobile optimization. |
There was a problem hiding this comment.
Similar, I would keep it with some changes.
I think this is orthogonal to tilers, but it can still benefit bandwidth, so using mediump is recommended on mobile GPUs.
I personally prefer explicit types such as float16.
Note that some desktop GPUs can also benefit from float16.
| [[synchronization-and-subpasses]] | ||
| === Synchronization and Pipeline Flow | ||
|
|
||
| Frequent synchronization points—like calling `vkQueueWaitIdle`—can cause the GPU to stall while waiting for the CPU, or vice versa. |
There was a problem hiding this comment.
Is this tiler specific? I think it a best practices for all GPUs?
| Incorrect use of pipeline barriers can break this overlap. | ||
| If you use a barrier that is too broad—like `VK_PIPELINE_STAGE_ALL_COMMANDS_BIT`—you might force the GPU to finish all pending fragment work before it can even start the binning pass for the next set of draws. | ||
| Instead, use the most specific stages and access masks possible. | ||
| For example, if a compute shader produces data for a vertex buffer, the barrier should only synchronize the compute stage with the vertex input stage. |
There was a problem hiding this comment.
I would mention that ALL_GRAPHICS–ALL_GRAPHICS should generally be avoided. It’s better to separate VERTEX_SHADER_BIT and FRAGMENT_SHADER_BIT, and mark resources according to the stages that actually use them.
NB, fix the TBR link to the Simple Game Engine tutorial when it is published.